How to Scrape a Website for Data

David Benson

Scraping a website is a technique used to extract simple data from a website. It is a useful skill to have and can save you a lot of time. You might find a site that has tables of data spread out over multiple pages and want that data for analysis. It’s possible to copy and paste tables into excel. This could take ages if you are dealing with hundreds of tables. In R, with the help of a package called ‘rvest’, we can do this really quick.

This guide shows you the basic procedure for web scraping with an example of football goalscoring data (coincidentally the same data used for my data visualisation “Prolific Goalscorers in the English Top Flight (1988-2015)”)). The website example for this guide is located here and has a table of the top 737 goalscorers in the English League (1988-2015) spread out over 7 pages.

Below shows the first page of these tables:

When a player is clicked on, a new page is loaded with a table of the amount of goals that player scored in each season. It is this table we want to scrape for every player. We want to scrape the name of each player, the number of goals per season, and the season they scored these goals.

Scraping the Data

We now want to be directed to each page from our links list and extract the data. A useful html node to extract is ‘table’. This extracts every part of a page classed as a HTML table. We can select each table using the ‘html_table’ function. For some reason the creator of the website has used a table for the name of each footballer. We therefore want 2 tables. Using Jimmy as an example, the tables we would like the table “Jimmy Greaves - 357 goals” (table 1) and the table of all his goals and the years he scored them.

The loop below extracts this information for every footballer. We combine the information into a data frame called ‘join’.

name = list()
goals = list()
join = list()
for (i in seq(length(links))){
  tables <- read_html(links[i]) %>%
    html_nodes('table')
  
  name[[i]] <- html_table(tables[[1]],fill=TRUE)
  goals[[i]] <- html_table(tables[[2]], fill=TRUE, header=FALSE)
  join[[i]] <- data.frame(name[i],goals[i])
}

When the loop is finished, ‘join’ becomes a list of data frames. Using ‘plyr’ and ’rbind.fill, we can bind everything into one big dataframe.

library(plyr)
allScorers <- rbind.fill(join)
str(allScorers)
## 'data.frame':    8720 obs. of  3 variables:
##  $ X1  : chr  "Jimmy Greaves - 357 goals" "Jimmy Greaves - 357 goals" "Jimmy Greaves - 357 goals" "Jimmy Greaves - 357 goals" ...
##  $ X1.1: chr  "West Ham" "1970-1971" "1969-1970" "Tottenham" ...
##  $ X2  : chr  "13" "9" "4" "220" ...

Wrapping Up

We now have a dataset of 8720 observations. This is by no means the end. The data needs to also be cleaned as there is likely typos and other errors hidden in it. For my data visualisation,“Prolific Goalscorers in the English Top Flight (1988-2015)”, I also needed to convert the data into a time series, where each row represented a year and each column, a player. For now however, the steps shown in this guide should provide you with a enough knowledge to scrape a static website. For a website more complicated, you may need access to their API.